Random Notes:

GPU Programming

You should generally schedule 4x the amount of threads as can be run simultaneously - gpus are good at context switching so you should make sure they always have work to do when waiting.

L3/L2 cache is managed by the hardware, but the L1 cache is handled by the user manually.

GPU Programming

Related Reading